visual hallucination
Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency
Morbiato, Filippo, Romano, Luca, Persona, Alessandro
Visual hallucination, where Multimodal Large Language Models fabricate details inconsistent with image content, critically undermines their reliability. Existing fine-tuning methods offer limited improvement, failing to deeply intervene in factual reasoning. This paper introduces Grounded Visual Factualization (GVF) Finetuning, a novel approach to systematically enhance MLLM visual factual consistency. GVF integrates explicit factual signals via three core mechanisms: Factual Anchor Data Augmentation, enriching training data with structured factual anchors and counter-factual prompts; Fact-Aware Instruction Tuning, embedding these cues into explicit instructions; and a Factual Consistency Loss function, specifically penalizing factual inaccuracies. Evaluated on LLaVA-1.5-13B, GVF Finetuning significantly outperforms standard fine-tuning on the VHTest benchmark for both Open-Ended Question (OEQ) and Yes/No Question (YNQ) formats. Crucially, GVF maintains or even slightly improves performance on general multimodal benchmarks like MME and POPE, demonstrating effective mitigation of visual hallucinations without compromising general understanding and reasoning abilities.
Modeling Visual Hallucination: A Generative Adversarial Network Framework
Zareh, Masoumeh, Manshaei, Mohammad Hossein, Zahabi, Sayed Jalal, Krunz, Marwan
Visual hallucination refers to the perception of recognizable things that are not present. These phenomena are commonly linked to a range of neurological/psychiatric disorders. Despite ongoing research, the mechanisms through which the visual system generates hallucinations from real-world environments are still not well understood. Abnormal interactions between different regions of the brain responsible for perception are known to contribute to the occurrence of visual hallucinations. In this study, we propose and extend a generative neural network-based framework to address challenges within the visual system, aiming to create goal-driven models inspired by neurobiological mechanisms of visual hallucinations. We focus on the adversarial interactions between the visual system and the frontal lobe regions, proposing the Hallu-GAN model to suggest how these interactions can give rise to visual hallucinations. The architecture of the Hallu-GAN model is based on generative adversarial networks. Our simulation results indicate that disturbances in the ventral stream can lead to visual hallucinations. To further analyze the impact of other brain regions on the visual system, we extend the Hallu-GAN model by adding EEG data from individuals. This extended model, referred to as Hallu-GAN+, enables the examination of both hallucinating and non-hallucinating states. By training the Hallu-GAN+ model with EEG data from an individual with Charles Bonnet syndrome, we demonstrated its utility in analyzing the behavior of those experiencing hallucinations. Our simulation results confirmed the capability of the proposed model in resembling the visual system in both healthy and hallucinating states.
Mapping of Subjective Accounts into Interpreted Clusters (MOSAIC): Topic Modelling and LLM applied to Stroboscopic Phenomenology
Beauté, Romy, Schwartzman, David J., Dumas, Guillaume, Crook, Jennifer, Macpherson, Fiona, Barrett, Adam B., Seth, Anil K.
Stroboscopic light stimulation (SLS) on closed eyes typically induces simple visual hallucinations (VHs), characterised by vivid, geometric and colourful patterns. A dataset of 862 sentences, extracted from 422 open subjective reports, was recently compiled as part of the Dreamachine programme (Collective Act, 2022), an immersive multisensory experience that combines SLS and spatial sound in a collective setting. Although open reports extend the range of reportable phenomenology, their analysis presents significant challenges, particularly in systematically identifying patterns. To address this challenge, we implemented a data-driven approach leveraging Large Language Models and Topic Modelling to uncover and interpret latent experiential topics directly from the Dreamachine's text-based reports. Our analysis confirmed the presence of simple VHs typically documented in scientific studies of SLS, while also revealing experiences of altered states of consciousness and complex hallucinations. Building on these findings, our computational approach expands the systematic study of subjective experience by enabling data-driven analyses of open-ended phenomenological reports, capturing experiences not readily identified through standard questionnaires. By revealing rich and multifaceted aspects of experiences, our study broadens our understanding of stroboscopically-induced phenomena while highlighting the potential of Natural Language Processing and Large Language Models in the emerging field of computational (neuro)phenomenology. More generally, this approach provides a practically applicable methodology for uncovering subtle hidden patterns of subjective experience across diverse research domains.
Visual Hallucination: Definition, Quantification, and Prescriptive Remediations
Rani, Anku, Rawte, Vipula, Sharma, Harshad, Anand, Neeraj, Rajbangshi, Krishnav, Sheth, Amit, Das, Amitava
The troubling rise of hallucination presents perhaps the most significant impediment to the advancement of responsible AI. In recent times, considerable research has focused on detecting and mitigating hallucination in Large Language Models (LLMs). However, it's worth noting that hallucination is also quite prevalent in Vision-Language models (VLMs). In this paper, we offer a fine-grained discourse on profiling VLM hallucination based on two tasks: i) image captioning, and ii) Visual Question Answering (VQA). We delineate eight fine-grained orientations of visual hallucination: i) Contextual Guessing, ii) Identity Incongruity, iii) Geographical Erratum, iv) Visual Illusion, v) Gender Anomaly, vi) VLM as Classifier, vii) Wrong Reading, and viii) Numeric Discrepancy. We curate Visual HallucInation eLiciTation (VHILT), a publicly available dataset comprising 2,000 samples generated using eight VLMs across two tasks of captioning and VQA along with human annotations for the categories as mentioned earlier.
Cartoon Hallucinations Detection: Pose-aware In Context Visual Learning
Kim, Bumsoo, Shin, Wonseop, Lee, Kyuchul, Seo, Sanghyun
Large-scale Text-to-Image (TTI) models have become a common approach for generating training data in various generative fields. However, visual hallucinations, which contain perceptually critical defects, remain a concern, especially in non-photorealistic styles like cartoon characters. We propose a novel visual hallucination detection system for cartoon character images generated by TTI models. Our approach leverages pose-aware in-context visual learning (PA-ICVL) with Vision-Language Models (VLMs), utilizing both RGB images and pose information. By incorporating pose guidance from a fine-tuned pose estimator, we enable VLMs to make more accurate decisions. Experimental results demonstrate significant improvements in identifying visual hallucinations compared to baseline methods relying solely on RGB images. This research advances TTI models by mitigating visual hallucinations, expanding their potential in non-photorealistic domains.
A machine-learning method hallucinates its way to better text translation
As babies, we babble and imitate our way to learning languages. We don't start off reading raw text, which requires fundamental knowledge and understanding about the world, as well as the advanced ability to interpret and infer descriptions and relationships. Rather, humans begin our language journey slowly, by pointing and interacting with our environment, basing our words and perceiving their meaning through the context of the physical and social world. Eventually, we can craft full sentences to communicate complex ideas. Similarly, when humans begin learning and translating into another language, the incorporation of other sensory information, like multimedia, paired with the new and unfamiliar words, like flashcards with images, improves language acquisition and retention. Then, with enough practice, humans can accurately translate new, unseen sentences in context without the accompanying media; however, imagining a picture based on the original text helps.
Hallucinating to better text translation
As babies, we babble and imitate our way to learning languages. We don't start off reading raw text, which requires fundamental knowledge and understanding about the world, as well as the advanced ability to interpret and infer descriptions and relationships. Rather, humans begin our language journey slowly, by pointing and interacting with our environment, basing our words and perceiving their meaning through the context of the physical and social world. Eventually, we can craft full sentences to communicate complex ideas. Similarly, when humans begin learning and translating into another language, the incorporation of other sensory information, like multimedia, paired with the new and unfamiliar words, like flashcards with images, improves language acquisition and retention. Then, with enough practice, humans can accurately translate new, unseen sentences in context without the accompanying media; however, imagining a picture based on the original text helps.
What causes hallucinations: Mouse study uncovers unexpected changes in the brain's signaling
Scientists may finally be close to understanding the mechanisms behind hallucination. Despite the dramatic perceptual changes that take place during a bout of hallucination, what exactly happens in the brain during these moments has long been a mystery. A new study on drugged mice has found that the phenomena associated with these episodes may be the result of reduced signaling in the visual cortex – not an increase, as was expected. The images revealed that even after taking the drug, the signals being sent were largely similar to those seen in its absence, indicating that the information itself does not change. 'You might expect visual hallucinations would result from neurons in the brain firing like crazy, or by mismatched signals,' says senior author Cris Niell, an associate professor and member of the Institute of Neuroscience at the University of Oregon.
'Hallucination machine' gives drug-free psychedelic trip
A'hallucination machine' that sends your brain on a psychedelic trip without the need for drugs has been developed by scientists. Using Google Artificial Intelligence and a virtual reality headset, the device makes users hallucinate as if they have taken LSD or magic mushrooms. The machine was developed to help researchers better understand how the brain responds to altering realities. Brain scans taken on people using the machine could help determine if our'reality' is just a type of hallucination, the researchers claim. Through a virtual reality headset, the hallucination machine repeatedly shows selected images and patterns, such as a dog (top right) or colourful lines (bottom left) and spirals (bottom right) layered over reality.